.
THE OFFICE
TEXT ANALYSIS
This projectβs main purpose is to analyze a TV show in a reliable and measurable way, without the need to watch the whole show or rely on a personal perspective. The selected subject for this analysis is the sitcom βThe Officeβ, which was selected mainly for the high availability of data.
This notebook will use the data previously collected and cleaned, to go through the analysis process.
To analyze a TV show in a reliable and measurable way, without the need to watch the whole show or relying on a personal perspective.
The main questions this analysis will try to answer are described below:
Who are the main characters?
This question can be defined as a descriptive question, where the analysis will use simple statics to identify the most relevant characters.
How they communicate? This is an explanatory question, where we'll use both uses the descriptive statistics together with some linguistic methods of natural language processing to explain the relations between the characters, the polarity in their dialogs, and what words and terms could be used to describe them.
The selected show for this project was a sitcom from NBC, named The Office.
This TV series was aired from 2005 to 2013 and is still one of the most-watched shows on Netflix.

IMDb describes the show as:
βA mockumentary on a group of typical office workers, where the workday consists of ego clashes, inappropriate behavior, and tedium.β
The criteria for selecting this tv show are:
The data collection process used two scripts their main functionalities are listed below:
The_Office_Scraper.py
Create a CSV file to store the collected data;
The CSV contains the fields:
IMDB_Scraper.py
Create a CSV file to store the collected data;
The CSV file contains the fields:
The project was built with Python 3.7, it uses a mix of Python scripts and Jupyter notebooks.
In the previous notebooks we used mostly Pandas, Numpy, BeautifulSoup, NLTK, and VADER to collect, clean and prepare the data.
The libraries used in this notebook are:
Pandas
for data wrangling;
Numpy, Sklearn, sciPy, and spaCy
for mathematical, statistical and machine-learning related tasks;
NetworkX, matplotlib, Seaborn
for visualizations;
# install required libraries
import sys
# !{sys.executable} -m pip install numpy
# !{sys.executable} -m pip install pandas
# !{sys.executable} -m pip install scipy
# !{sys.executable} -m pip install scikit-learn
# !{sys.executable} -m pip install spacy
# !{sys.executable} -m pip install matplotlib
# !{sys.executable} -m pip install seaborn
# !{sys.executable} -m pip install networkx
import json
import scipy
import spacy
import pandas as pd
import numpy as np
import networkx as nx
import seaborn as sb
import matplotlib.pyplot as plt
from math import pi
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from scipy.interpolate import make_interp_spline, BSpline
from IPython.core.display import display, HTML
from IPython.display import IFrame
# download model for spaCy
# python -m spacy download en_core_web_sm
import en_core_web_sm
# configure backend to increase the visualizations resolution
%config InlineBackend.figure_format ='retina'
# load data
df = pd.read_csv('data/the_office_features.csv', sep=';', encoding='utf-16')
ratings = pd.read_csv('data/ratings.csv', sep=';', encoding='utf-16')
print('Main dataset, first 5 rows:')
df.head()
The first step to analyze the show is to define who are the main characters.
This may have several different interpretations, but for this project, we are considering some metrics to do so. The points to be considered are:
The number of dialogs and episodes is the main indicator of who are the main characters since the characters who had the biggest proportions of dialog and appeared in most of the episodes receive more attention and therefore should be the main characters.
There's a challenge in this part because of special guests and characters which had very big importance for a short amount of time. Those characters appear to had lots of dialogs, they participate in lots of episodes, but they're just around for a couple of seasons at most, this is why we're considering the season to calculate the main characters score.
To solve this issue I developed a score that considers all the above-mentioned approaches.
We start by aggregating the numerical fields and get their respective descriptive statistics such as means, standard deviations, medians, and other aggregations. This will help us calculate the scores, but will also serve as the basis for all the other analysis comparing the characters.
# Build a new data frame with aggregated measures
def build_df(temp):
temp_describe = temp.describe()
chars = temp_describe.index.to_list()
new_df = pd.DataFrame(chars)
new_df.columns = ['chars']
# count of dialogs
new_df['dialogs'] = temp_describe['id']['count'].values
# Words
new_df['avg_words'] = temp_describe['words_qty']['mean'].values
new_df['std_words'] = temp_describe['words_qty']['std'].values
new_df['25%_median_words'] = temp_describe['words_qty']['25%'].values
new_df['50%_median_words'] = temp_describe['words_qty']['50%'].values
new_df['75%_median_words'] = temp_describe['words_qty']['75%'].values
# Sentences
new_df['avg_sentences'] = temp_describe['sentences_qty']['mean'].values
new_df['std_sentences'] = temp_describe['sentences_qty']['std'].values
# Sentiment analysis
new_df['positive'] = temp_describe['positive']['mean'].values
new_df['neutral'] = temp_describe['neutral']['mean'].values
new_df['negative'] = temp_describe['negative']['mean'].values
new_df['compound'] = temp_describe['compound']['mean'].values
# total words, number of seasons and number of episodes
new_df['total_words'] = temp.sum()['words_qty'].values
new_df['unique_s'] = temp['season'].nunique().values
new_df['unique_ep'] = temp['ep_seas'].nunique().values
return new_df
# Similar to the previous method, for building a single dataframe per characters
def build_char_df(name):
temp = df[df['name'] == name].groupby(by='episode_name')
char_df = pd.DataFrame(temp.count().index)
char_df.columns = ['ep_name']
char_df['dialogs'] = temp.count()['text'].values
char_df['mean_sent'] = temp.mean()['sentences_qty'].values
char_df['mean_words'] = temp.mean()['words_qty'].values
char_df['mean_positive'] = temp.mean()['positive'].values
char_df['mean_negative'] = temp.mean()['negative'].values
char_df['mean_neutral'] = temp.mean()['neutral'].values
char_df['mean_compound'] = temp.mean()['compound'].values
char_df['total_sent'] = temp.sum()['sentences_qty'].values
char_df['total_words'] = temp.sum()['words_qty'].values
ratings['ep_name'] = [x.upper() for x in ratings['ep_name']]
char_df = char_df.merge(ratings,
how='right',
left_on='ep_name',
right_on='ep_name')
char_df.drop(['Unnamed: 0'], axis=1, inplace=True)
char_df = char_df.sort_values(by=(['season', 'ep_num'])).fillna(0)
return char_df
This is the indicator I developed to help to find the main characters of the series and classify them by relevance to the show.
Score = nep + (nd / nep) * (ns/5)
nep = number episodes;
nd = number of dialogs;
ns = number of seasons;
5 is a threshold I used.
The idea is to "penalize" characters that appeared in less than 5 seasons (approx. half the series) and give more significance to characters that appeared in more than 5 seasons.
temp = df.groupby('name')
top_chars = build_df(temp)
dialog_avg = top_chars['dialogs'] / top_chars['unique_ep']
score = top_chars['unique_ep'] + dialog_avg * (top_chars['unique_s'] / 5)
top_chars['score'] = score
all_chars = top_chars.sort_values(by='score', ascending=False)[15:]
top_chars = top_chars.sort_values(by='score', ascending=False)[:15]
top_chars
#save the grouped version of the dataset
#top_chars.to_csv('the_office_main_chars.csv', sep=';', encoding='utf-16', index=False)
#plot the main characters and their scores
top_chars = top_chars.sort_values(by='score', ascending=True)
fig, ax = plt.subplots(1,figsize=(16,6))
plt.barh(top_chars['chars'], top_chars['score'])
plt.title('Most relevant characters')
ax.spines['right'].set_visible(False)
ax.spines['left'].set_visible(False)
ax.spines['bottom'].set_visible(False)
ax.spines['top'].set_visible(False)
ax.set_axisbelow(True)
ax.grid(axis='x', linestyle='--')
With help from theoffice.fandom.com, which is the 'wiki' webpage for the series, we can outline some information about the characters and compare their relationship with the other characters that had a similar score.
Michael is the Manager of the office, and according to our score is the lead character of the show, followed by Dwight who's the 'Assistant to the Regional Manager' for most of the show.
The list follows with:
Jim and Pam, who play a romantic couple in the show.
Kevin, Angela, Oscar (after Andy), they all work in the accounting and sit close to each other.
Andy, who's score is in the between the accounting people, is a salesman that entered later in the series.
Phyllis and Stanley are both sales-people and sit in front of each other.
Ryan and Kelly also play a romantic couple in the show.
Meredith and Creed, have a more cartoonish aspect to their characters.
Darryl, who's in the show from the beginning but it doesn't appear so much in the earlier seasons since he works bellow the office at the warehouse.
top_chars = top_chars.sort_values(by='score', ascending=False)
top_chars[['chars','unique_ep', 'dialogs', 'unique_s', 'score']]
Something that stands out in the above listing is the number of seasons, while most characters selected by the score have participated in all the nine seasons there are three characters that didn't.
Those characters are Michael, Andy, and Creed. I decided to research why those characters have less seasons than the other main characters and this leed to some interesting information.
Andy, only entered the show in the third season.
Creed, according to theoffice.fandom, appeared in the background of the first season, but didn't have any dialogs.
Michael that even with one less season achieved the highest score, left the show in the middle of the seventh season, and returned only for the last episode, so he didn't participate at all in the eight season.
To test our score we can compare the distributions for our selected variables.
In the bellow chart the values are displayed as:
The chart compares the main characters(red) selected by the score with all the other characters(blue).
fig, ax = plt.subplots(1, figsize=(16, 8))
# plot all characters
plt.scatter(all_chars.unique_ep,
all_chars.dialogs,
linewidths=all_chars.unique_s,
label=top_chars.chars)
# plot main charactes
plt.scatter(top_chars.unique_ep,
top_chars.dialogs,
linewidths=top_chars.unique_s,
label=top_chars.chars,
color='red')
# set chart details
plt.title('Characters Dialogs and Number of Episodes')
plt.legend(['All Characters', 'Main Characters'])
plt.xlabel('Episodes')
plt.ylabel('Dialogs')
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
We can see that the score works properly for selecting the characters with most episodes and season. All the characters on the right side of the chart, with a high number of episodes, were selected by our algorithm.
We can also see how the score is handling the 'Average dialogs', in the bellow chart we have:
fig, ax = plt.subplots(1, figsize=(16, 8))
plt.scatter(all_chars.dialogs / all_chars.unique_ep,
all_chars.dialogs,
linewidths=all_chars.unique_s,
label=top_chars.chars)
plt.scatter(top_chars.dialogs / top_chars.unique_ep,
top_chars.dialogs,
linewidths=top_chars.unique_s,
label=top_chars.chars,
color='red')
plt.title('Characters Dialogs, Totals and Averages')
plt.legend(['All Characters', 'Main Characters'])
plt.xlabel('Average Dialogs')
plt.ylabel('Total Dialogs')
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
We can see with the above chart that the number of seasons was great at separating characters to select not just the characters with a high average dialogs per episode, but the ones who also participated throughout the whole show.
The main characters are:

*The above list is for visual display of the characters and is not sorted.
A very interesting characteristic we can analyze is the number of words and sentences a character says, the concept for this analysis is that if you have a high average of those numbers you are either too subjective or you have lots to say.
One hypothesis is that you may have too much or too complex information to communicate, in this case you would have lots to say. The alternative would be that you don't have to communicate much, but you are using too much words for it, this would mean you're beign subjective.
In the bellow displayed chart, we can see the blue bars representing the means, and the black lines the standard deviation of those, the problem, in this case, is that there's a huge difference between the mean and the standard deviations. This means our data have extreme outliers, so the averages are not such a good indication of who talks more or less, they just give us a slight idea of it.
top_chars = top_chars.sort_values(by='avg_words', ascending=True)
fig, ax = plt.subplots(1, figsize=(16, 6))
plt.barh(top_chars['chars'],
top_chars['avg_words'],
xerr=top_chars['std_words'])
plt.xlim([0, 31])
plt.title('Average words in dialog')
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
ax.spines['left'].set_visible(False)
ax.spines['bottom'].set_visible(False)
ax.set_axisbelow(True)
ax.grid(axis='x', linestyle='--')
Since we're not able to get the full understanding with the means, we can analyse the medians for those characters.
top_chars = top_chars.sort_values(by='50%_median_words', ascending=True)
fig, ax = plt.subplots(1, figsize=(16, 6))
plt.bar(top_chars['chars'], top_chars['50%_median_words'])
plt.title('Median words in a dialog')
ax.spines['right'].set_visible(False)
ax.spines['left'].set_visible(False)
ax.spines['bottom'].set_visible(False)
ax.spines['top'].set_visible(False)
ax.set_axisbelow(True)
ax.grid(axis='y', linestyle='--')
Medians complement our understanding, we can see that Michael and Andy still have the highest numbers, Dwight and Kelly switched places as the third and fourth, most characters have the same median which is 6, and apparently Kevin says less words than all the others.
We can say that this reinforces the importance of Michael and Dwight as a leading characters, but why does Andy has the second highest average and median and not Dwight?
Andy's high number of dialogs and amount of words suggest that even though he entered the show later he was very important to the plot of the show.
Some other insights we can support from this data are:
Talking a lot is actually one of Kelly's personality traces, and this is actualy commented on and joked about trough the series.
display(IFrame("https://www.youtube.com/embed/VSv64fV0LDk?start=87", width="560", height="315"))
Kevin, from all the main characters uses in median the lowest amount of words per dialogs, but this doesn't mean he's extremely objctive or consise in his conversations.
display(IFrame('https://www.youtube.com/embed/_K-L9uhsBLM', width="560", height="315"))
There are plenty different methods to develop a sentiment analysis, one of the most evident methods are the pre-trained machine learning models, those are ready to go algorithms that can classify texts into positive, negative and neutral.
The problem found with all the pre-trained models researched is they are either trained with social media data, or with products review/ ratings data. Those means of communication differ a lot from the data weβre analysing so it wouldnβt be appropriate to use them. Besides the pre-trained models, there are a few other open sourced algorithms to train our own models, the biggest problem with this would be to label our data to train a model, and this would require to much time and resources.
The best solution found to satisfy the needs of the project was VADER.
(Valence Aware Dictionary and sEntiment Reasoner) https://github.com/cjhutto/vaderSentiment
VADER uses a dictionary to assign scores to the words, while considering their location within the text and punctuations to score the document with a proportion of each sentiment contained on it. Those sentiments are named negative, positive and neutral.
After getting the proportions for each sentiment VADER calculates a compound. The compound is a normalized sum of all proportions, from -1 (completely negative) to 1 (completely positive).
Some of the semantic contexts considered by VADER are:
Conjunctions E.g.: 'I like your X, but your Y is very bad';
Negation Flips E.g.: 'This is not really the greatest';
Degrees E.g.: 'This is good' vs 'This is extremely good';
Capitalization E.g.: 'this is GREAT' vs 'this is great';
Punctuation E.g.: 'this is great!!!' vs 'this is great';
Visualize Polarities
The visualization of the results aims at displaying the characters of the show, and the average positive and negative sentiments for each of them.
For a fair perspective of those values we're comparing them in the same scales where:
So the range of 0.04 to 0.23, is applied to both the x and y axis.
fig, ax = plt.subplots(1, figsize=(8, 8))
# plot all characters sentiment
plt.scatter(top_chars['positive'], top_chars['negative'], marker='x')
# set grid lines
ax.axhline(0.14, linestyle='--', color='grey')
ax.axvline(0.14, linestyle='--', color='grey')
# set limits
plt.xlim([0.04, 0.23])
plt.ylim([0.04, 0.23])
# plot labels and title
plt.xlabel('Positive')
plt.ylabel('Negative')
plt.title('Main Characters\nAverage Sentiment Polarity')
outliers = ['Angela', 'Stanley', 'Meredith']
for i, name in enumerate(top_chars['chars'].values):
if (name in outliers):
position = top_chars[['positive', 'negative']].values[i]
ax.annotate(name, position)
We can see that most characters have a similar behavior in matters of polarity in their dialogs, the values concentrate in high positive and low negative for the vast majority of them, but we can also see some outliers away from the group.
As mentioned before, most of the characters have a high positive score of around 0.14 to 0.20, with a low negative score of 0.06 to 0.8.
But we can note some characters with higher negative scores and also a character with a lower positive score.
Stanley, is the most distant from the other characters, he has a relatively low positive score but his negative score isn't so high either.
This means his dialogs are mostly neutral, almost like he doesn't want to get involved.
display(IFrame("https://www.youtube.com/embed/bqAhJcSQQG4?start=334" , width="560", height="315"))
fields = [
'chars', 'dialogs', 'avg_words', 'positive', 'neutral', 'negative',
'compound', 'unique_s', 'unique_ep', 'score'
]
top_chars[top_chars['chars'].isin(['Angela', 'Stanley', 'Meredith'])][fields]
Most of the show goes around positive dialogs.
Some outliers have a lower amount of positive dialogs and some have a higher amount of negative dialogs, but even they had in overall higher positive dialogs than negatives.
The file 'conversations.json' contains one record for every scene on the show, where the record contains the name of the characters that had some dialog in the scene and the respective number of dialogs that character had.
These conversations will be used to calculate a score for the relations between the characters.
file = open('data/conversations.json')
conversations = file.read()
conversations = json.loads(conversations)
print('first 5 rows:')
conversations[:5]
In order to compare the relationship between the characters the following formula was developed:
Where:
nx = number of dialogs character x had in a conversation;
ny = number of dialogs character y had in a conversation;
This score is based on the concept that a perfectly balanced conversation will have the same amount of dialogs between both agents.
E.g.: A conversation with three characters x, y and z;
Where x said 5 dialogs, y said 5 dialogs, and z said 1 dialog will result in a score between x and y of 1, while the score between x and z will be 0.2.
The scores are them aggregated with all scores from the same relation so they can be compared, it's important to note that this will result in generally higher scores for characters that communicate a lot and lower scores for characters that don't.
# relationship score calculation
def calc_score(a, b):
if (a < b):
temp = a
a = b
b = temp
return b / a
scores = []
names = []
for name_A in top_chars['chars'].unique():
for name_B in top_chars['chars'].unique():
score = 0
if name_A == name_B:
continue
#print(name_A, name_B)
for talk in conversations:
if name_A in talk and name_B in talk:
score += calc_score(talk[name_A], talk[name_B])
scores.append(score)
names.append([name_A, name_B])
# build a dataframe with the main character names
df_rel = pd.DataFrame(top_chars['chars'].unique())
df_rel.columns = ['names']
# fill the dataframe with 0s
for name in top_chars['chars'].unique():
df_rel[name] = np.zeros(len(top_chars['chars'].unique()))
# set name as index
df_rel.set_index('names', inplace=True)
# store scores to their respective rows
for i, n in enumerate(names):
df_rel[n[0]][n[1]] = scores[i]
After calculating the relationship scores for every character of the show we have the following data:
df_rel
At this point we'll start comparing the relationships and describing them as 'strong' or 'weak', depending on the value of their scores. It's important to note that a strong relationship in this context doesn't relate to the sentiment involved between the characters, so it won't necessarily be a positive relation.
In this context, a strong relationship means the characters communicate a lot.
# Build an array for masking the repeated values
mask = []
for i in np.arange(len(top_chars['chars'].unique())):
mask.append(
np.concatenate((np.ones(i + 1, dtype=bool),
np.zeros(len(top_chars['chars'].unique()) - i - 1,
dtype=bool))).tolist())
# plot heatmap
fig, ax = plt.subplots(1, figsize=(16, 8))
sb.heatmap(df_rel,
annot=True,
fmt="g",
cmap='RdBu',
mask=mask,
vmin=0,
vmax=500)
plt.show()
By themselves the scores are already very meaningful, we can tell that Pam and Jim have the strongest relationship among all the other relations.
We can also notice that Michael, the main character of the show, has an overall higher score with everybody when compared to 'lower-ranked' main characters such as Meredith, Creed, or Darryl.
This makes sense from the perspective that Michael has been communicating more constantly with everybody in the show, so he probably has a stronger relationship with most characters.
To extract even more information about the relationships we can normalize the scores, in this case we'll do so by standarizing the values, or calculating their z-scores. This will allow us to see how many standard deviations aways from the mean each relation is.
Simplyfing, we want to see how extreme are those relationships for each characther.
# Z-Scores
df_nor = (df_rel - df_rel.mean()) / df_rel.std()
# mask to remove results between the same character
mask = []
for i in np.arange(len(top_chars['chars'].unique())):
mask.append(
np.concatenate((np.zeros(i, dtype=bool), np.array([1], dtype=bool),
np.zeros(len(top_chars['chars'].unique()) - i - 1,
dtype=bool))).tolist())
# plot standarized values heatmap
fig, ax = plt.subplots(1, figsize=(22, 8))
sb.heatmap(df_nor, annot=True, fmt="g", cmap='RdBu', mask=mask)
ax.invert_yaxis()
ax.xaxis.tick_top()
plt.show()
One way of improving this visualization is by showing the actual p-values, they represent how likelly it should be to find those values in the distribution.
In this case, we'll look for relationships with a lower than 0.05 p-value, to account for 95% of confidence level that those relationships have a statistically significant difference from the average relationships of the analysed characters.
df_p_values = scipy.stats.norm.sf(abs(df_nor)) * 2
fig, ax = plt.subplots(1, figsize=(22, 8))
sb.heatmap(df_p_values, annot=True, fmt="g", cmap='Blues_r', mask=mask)
ax.invert_yaxis()
ax.xaxis.tick_top()
plt.xticks(np.arange(0.5, 15.5), df_rel.columns)
plt.yticks(np.arange(0.5, 15.5), df_rel.columns, rotation=0.9)
plt.show()
With 95% of confidence, the bellow listed relationships had a higher amount of conversation score than the average relationships.
Michael -> Dwight
Dwight -> Michael
Dwight -> Jim
Jim -> Dwight
Jim -> Pam
Pam -> Jim
Angela -> Dwight
Andy -> Dwight
Darryl -> Andy
Ryan -> Michael
Stanley -> Phyllis
# build dictionary with lists of 'from' and 'to', for plotting the network graphs
from_to = {
'from': [
'Michael', 'Dwight', 'Dwight', 'Jim', 'Jim', 'Pam', 'Angela', 'Andy',
'Darryl', 'Stanley', 'Ryan'
],
'to': [
'Dwight', 'Michael', 'Jim', 'Dwight', 'Pam', 'Jim', 'Dwight', 'Dwight',
'Andy', 'Phyllis', 'Michael'
]
}
# build a data frame from the dictionary
df_net = pd.DataFrame(from_to)
Visualize the strongest relationships in a network chart
# plot network chart
fig, ax = plt.subplots(1, figsize=(16, 8))
G = nx.from_pandas_edgelist(df_net, 'from', 'to', create_using=nx.DiGraph())
nx.draw(G,
with_labels=True,
node_size=3500,
alpha=0.5,
arrows=True,
linewidths=1,
font_size=15,
pos=nx.circular_layout(G))
plt.title("Relationships")
The word and terms frequency can give us an interesting perspective of how the characters communicate and what the show is about.
from wordcloud import WordCloud, STOPWORDS
from PIL import Image
char_text = ' '.join(df.clean_txt.astype(str).values)
wordcloud = WordCloud(width=1800,
height=1800,
background_color='#293F3F',
colormap="Reds",
max_font_size=250).generate(char_text.upper())
fig = plt.figure(figsize=(15, 15), facecolor=None)
plt.imshow(wordcloud)
plt.axis("off")
plt.tight_layout(pad=0)
We can see in the above visualization that many of the words relate to people, words such as names, and pronouns are very common in their daily communications. We can also see that many of those words have little to no meaning by themselves.
To improve on that we can check what are the distinguishable terms spoken by the characters, in other words, we'll remove words that are common to all characters and focus on the words that are specific to each of the main characters.
Term Frequency - Inverted Document Frequency (TF-IDF), is a method to compare how many times a term appeared in a document with how many documents the term appeared in.
TF * IDF
TF = Frequency(term)
IDF = log ( Number of documents / Number of documents containing the term)
After calculating the TF-IDF scores we get the difference between the mean score for all characters and the character score, this will show how above or below the average each word was said by the character.
The result is then sorted to get the most above the average words for each character.
# Prepare a dataframe
# get texts by character
df_txt = df[df.name.isin(top_chars.chars.values)]
# group the texts for each character in a single string
all_txt = []
for char in df_txt.name.unique():
temp_df = df_txt[df_txt.name == char]
temp_txt = []
for i, row in temp_df.iterrows():
temp_txt.append(str(row.clean_txt))
all_txt.append(' '.join(temp_txt))
# Create a dataframe then add main characters and texts to it
df_txt = pd.DataFrame(df_txt.name.unique())
df_txt.columns = ['name']
df_txt['text'] = all_txt
# use a vectorizer to build a sparce matrix that'll hold every word for each character
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(df_txt.text)
# Test by checking how many times the word 'business' was said on the show
test = count_vect.vocabulary_.get(u'business')
print('*test*\nHow many times the words business appears: ' + str(test))
# calculate tfidf
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
# convert matrix to array and, get feature names from vector
df_words = pd.DataFrame(X_train_tfidf.toarray(),
columns=count_vect.get_feature_names())
# build data frame
df_words['word'] = df_txt.name.unique()
df_words.set_index('word', inplace=True)
df_words = df_words.transpose()
df_words['sum_score'] = df_words.sum(axis=1)
df_words['mean_score'] = df_words.sum(axis=1) / 15
print('last 5 rows:')
df_words.tail()
The 10 most distinct words by character
# Build a dataframes with the top 10 more distinguishable words for each character
df_tfidf = pd.DataFrame(np.arange(1, 11))
for name in top_chars.chars.values:
temp = (df_words[name] - df_words['mean_score'])
temp = temp.sort_values(ascending=False)[:10]
df_tfidf[name] = temp.index.tolist()
df_tfidf[name + '_values'] = temp.values
df_tfidf[top_chars.chars.values]
We can see some patterns in the disntinguish terms, such as all characters have some person that they mention in a distinguish way, with a high frequency and more than the other characters.
This means a great deal of the show is spent by talking about people, and personal relationships.
A better way to use this table would be to analyse the characters individually and research the combinations of 'Character name' + 'Term'.
This can lead us to a more detailed understanding of the characters, some are more explicit like Pam for example. She have in her distinguish words 'Mural', 'Paint', and 'Art', which are clear indications of her interest in arts.
With other characters we can find some subjective information, like Meredith, in her list we can see words such as 'Lice', 'Alcoholic', and 'Vagina', those words may be considered too intimate or even unconfortable and this is the essense of her character.
display(IFrame("https://www.youtube.com/embed/fzn1C2LNCBQ?start=327" , width="560", height="315"))
The Office is a show where the characters are mostly talking about other people and their relationships.
We can also conclude that the TF-IDF scores provide a quick and easy direction for researching on the characters for more details about their personalities.
One of the many ways of breaking down all this data is by analysing the characters individually, from this point on the previously discussed methods will be addapted for a single character.
Beside from the previously seem data, in this section we'll also explore the ratings.
# Define the character to be analysed
name = 'Michael'
myplot = build_char_df(name)
print('The options are:')
print(top_chars['chars'].values)
print('\nSelected character is: ' + name)
The polarity scores for each dialog were generated by VADER, please consult the previous section 'Sentiment Analysis' or the Data Cleaning and Preparation Notebook for more information about this method and its implementation.
The sentiment analysis displays high amounts of Neutral interactions and low amounts of negative and positive for most characters. To better visualize the small differences between those scores we can normalize them.
df_radar = top_chars[['chars', 'positive', 'neutral', 'negative']].copy()
df_radar.columns = ['chars', 'POS', 'NEU', 'NEG']
df_radar.set_index('chars', inplace=True)
normalized_df = df_radar
# normalize
# z-score = ( x - mean ) / standard deviation
normalized_df = (df_radar - df_radar.mean()) / df_radar.std()
# option 2, normalize by range (x - min)/(max - min)
#normalized_df = (df_radar-df_radar.min())/(df_radar.max()-df_radar.min())
normalized_df
To visualize the three normalized variables (positive, negative, and neutral), we'll be using radar charts, with the normalized data we can more easily compare the extents of each polarity in the selected character.
index = normalized_df.index.to_list().index(name)
# define figure size
fig, ax = plt.subplots(1, figsize=(8, 8))
# get the fields to a list
categories = list(normalized_df)
N = len(categories)
# Add values to a list and repeat last value to close the triangle
values = normalized_df.values[index].tolist()
values += values[:1]
# calculate the angles and repeat last value to close the cirle
angles = [n / float(N) * 2 * pi for n in range(N)]
angles += angles[:1]
# define a subplot axis
ax = plt.subplot(111, polar=True, facecolor='#494949')
# plot x lines and labels (neu, pos, neg)
plt.xticks(angles[:-1], categories, color='black', size=12)
# plot circles and labels (25%, 50%, 75%)
ax.set_rlabel_position(0)
plt.yticks([-1.5, 0, 1.5], ["-1.5", "0", "1.5"],
color="#FFD609",
size=13,
alpha=1)
plt.ylim(-2.75, 2.75)
# Plot data (lines)
ax.plot(angles, values, linewidth=0.6, linestyle='solid', color='black')
# fill area
ax.fill(angles, values, color='#0999FF', alpha=1)
# define title and save pic
plt.title(normalized_df.index[index])
plt.savefig(normalized_df.index[index] + '.png', edgecolor='none')
We can also visualize the distribution of the polarity trough the episodes, this should allow us to see changes in the character behavior and outliers that may be interesting to look closer.
fig, ax = plt.subplots(1, figsize=(14, 8))
x = np.arange(1, len(myplot['ep_name']) + 1)
# positive
plt.bar(x, myplot.mean_positive, width=1)
# neutral
plt.bar(x,
myplot.mean_neutral,
bottom=myplot.mean_positive,
color='grey',
width=1)
# negative
plt.bar(x,
myplot.mean_negative,
bottom=myplot.mean_positive + myplot.mean_neutral,
color='orange',
width=1)
# chart design details
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
ax.spines['left'].set_visible(False)
ax.spines['bottom'].set_visible(False)
plt.title(name + ' Sentiment by Episode')
In this section, we'll repeat the methods used in '5 - Words Frequency', but this time with a single character, and we'll also add a method from spaCy, that can help us identify the entities mentioned in the dialogs.
Here we can analyze the most distinguishable terms for a specific character, the sizes are adjusted as per the more distinguishable the term the bigger the font size.
# print an HTML <h2> tag for each word from the characters distinct terms,
# start printing with font-size 60 and reduce the size by 4 for each subsequent word
font_size = 60
for i in df_tfidf[name]:
display(
HTML('<h2 style="font-size:' + str(font_size) + 'px";>' + i.upper() +
'</h2>'))
font_size -= 4
Here we're building a word cloud with the most frequent terms the character said, the cleaned version of the text is being used for visualization.
# build a single string with all the text
char_text = ' '.join(df[df.name == name].clean_txt.astype(str).values)
# build the word cloud
wordcloud = WordCloud(
width=1800,
height=1800,
background_color='#293F3F',
colormap="Reds",
max_font_size=250).generate(char_text.upper())
# adjust and display the figure
fig = plt.figure(figsize=(15, 15), facecolor=None)
plt.imshow(wordcloud)
plt.axis("off")
plt.tight_layout(pad=0)
Here we'll visualize what are the most commonly mentioned entities, more specifically in this section we'll filter people, organizations, products, locations and events mentioned in the dialogs and them we'll count them to visualize the most mentioned in the show by the selected character
# load model
nlp = en_core_web_sm.load()
#build a single string with all the text and assign it to the spaCy object
doc = nlp('\n'.join(df[df.name == 'Michael'].text.values))
# get a list of tupples for the identified entities in the text
ent_list = [(ent.text, ent.label_) for ent in doc.ents]
# Build a dataframe
col_names = ['name', 'type']
df_ent = pd.DataFrame(ent_list)
df_ent.columns = col_names
df_ent['count'] = np.ones(len(ent_list))
# Filter types, sort the values and display the top 15 entities mentioned
types = ['PERSON', 'ORG', 'PRODUCT', 'LOC', 'EVENT']
df_ent_filtered = df_ent[df_ent['type'].isin(types)]
df_ent_filtered = df_ent_filtered.groupby(col_names).count()
df_ent_filtered = df_ent_filtered.sort_values('count', ascending=False)[:15]
# display top 15 most mentioned entities
df_ent_filtered
In regards to Michael, we can see something common between the words and terms frequencies. They're all strongly related to people.
In the TF-IDF scores, Michael's most distinguishable words have 2 pronouns (Everybody, and Somebody) and 5 names from the top 10 words. In the bag-of-words algorithm its harder to visualize the patterns since there are many meaningless words, but still, we can also see lots of names and pronouns related to people.
The strongest evidence of this is the most frequent entities mentioned by Michael, from the 15 words displayed only one is not a person, and this exception is actually the name of their city. This suggests that Michael is someone whose biggest interests are in people and the community.
https://www.youtube.com/watch?v=vrPgsrfZWOU&feature=youtu.be&t=327
Here we can verify the correlation (Pearson method) between the previously analysed measures and the actual ratings for the episodes
char_df = build_char_df(name)
# build df with pearson correlation
corr = char_df.corr(method='pearson')
# plot correlation map
fig, ax = plt.subplots(1,figsize = (3,5))
#ax = plt.axes()
happiness_corr = sb.heatmap(corr.iloc[:-1, -1:], vmin=-0.6, vmax=0.6, ax=ax)
fig = happiness_corr.get_figure()
ax.set_title(name)
plt.show()
We can also compare any given variable with the actual ratings, this helps us visualize how much related those values are.
var = 'total_words'
print('The options are:')
print([
'dialogs', 'mean_sent', 'mean_words', 'mean_positive', 'mean_negative',
'mean_neutral', 'mean_compound', 'total_sent', 'total_words'
])
print('\nSelected variable: ' + var)
To better visualize the patterns and movements of the selected variable and the ratings, both of them were interpolated to fit 50 data points. We're losing information by doing so, but it's easier to spot trends and visualize the overall direction of the comparison when we have a lower number of data.
fig, ax = plt.subplots(1, figsize=(18, 10))
plt.title(name)
def new_x(x):
#50 represents number of points to make between T.min and T.max
return np.linspace(x.min(), x.max(), 50)
x = np.arange(1, len(myplot['ep_name']) + 1)
spl = make_interp_spline(x, myplot[var], k=3)
power_smooth = spl(new_x(x))
plt.plot(new_x(x), power_smooth, linewidth=2)
plt.ylabel(var)
plt.xlabel('Episode')
plt.legend([var], loc='upper left')
ax2 = ax.twinx()
x = np.arange(0, len(myplot['ratings']))
spl = make_interp_spline(x, myplot['ratings'], k=3)
power_smooth = spl(new_x(x))
plt.plot(new_x(x), power_smooth, color='red', linewidth=2)
plt.ylabel('Rating')
plt.legend(['Ratings'], loc='upper right')
plt.show()
The show has an overall positive sentiment in the interactions it displays, where the main characters relate a lot with each other, and the most discussed things are the relationships between themselfs.
Some characters have a stronger relationship than the others, like Jim and Pam who play a romantic pair in the show and have the highest number of interactions between all the main characters.
According to the number of dialogs, episodes, seasons, and the size of the dialogs, Michael is by far the main character of the show, he's also the character with the strongest correlations between it's participation and the episodes ratings.
Dwight and Andy also had a very strong participation on the show, while Dwights participation was relativaly constant trought the show, Andy seems to have participated a lot more, but in a lower time range.
The characters also have some sort of group behavior associated to them, where we can identify those groups by their similarities in number of dialogs, episodes and their relationship scores.
Those group relationships also appear to have a similar structure to the positioning of the characters in The Office, those groups can be identified as:
All of them worked close to each other, and in most groups they all worked in the same department.
We can conclude from this report that The Office is not so much about the work in an office enviorment but more about the life of the people who work there, it does so by exploring more personal aspects of the characters such as their romantic lifes, aspirations, families, friends, personal challenges and disconforts.